6 research outputs found
On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training
In this paper, we explore an improved framework to train a monoaural neural
enhancement model for robust speech recognition. The designed training
framework extends the existing mixture invariant training criterion to exploit
both unpaired clean speech and real noisy data. It is found that the unpaired
clean speech is crucial to improve quality of separated speech from real noisy
speech. The proposed method also performs remixing of processed and unprocessed
signals to alleviate the processing artifacts. Experiments on the
single-channel CHiME-3 real test sets show that the proposed method improves
significantly in terms of speech recognition performance over the enhancement
system trained either on the mismatched simulated data in a supervised fashion
or on the matched real data in an unsupervised fashion. Between 16% and 39%
relative WER reduction has been achieved by the proposed system compared to the
unprocessed signal using end-to-end and hybrid acoustic models without
retraining on distorted data.Comment: Accepted to INTERSPEECH 202
On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments
This paper introduces a new method for multi-channel time domain speech
separation in reverberant environments. A fully-convolutional neural network
structure has been used to directly separate speech from multiple microphone
recordings, with no need of conventional spatial feature extraction. To reduce
the influence of reverberation on spatial feature extraction, a dereverberation
pre-processing method has been applied to further improve the separation
performance. A spatialized version of wsj0-2mix dataset has been simulated to
evaluate the proposed system. Both source separation and speech recognition
performance of the separated signals have been evaluated objectively.
Experiments show that the proposed fully-convolutional network improves the
source separation metric and the word error rate (WER) by more than 13% and 50%
relative, respectively, over a reference system with conventional features.
Applying dereverberation as pre-processing to the proposed system can further
reduce the WER by 29% relative using an acoustic model trained on clean and
reverberated data.Comment: Presented at IEEE ICASSP 202